Manipulating text and detecting patterns

Malo Jan & Luis Sattelmayer

2025-01-14

Outline

  • Yesterday : learn what is a corpus and how extract text from the web
  • This morning : learn how to manipulate text and detect patterns in text
    • Manipulate text with string manipulation
    • Build and use dictionnaries
    • Mix of lecture and practice

Manipulating text

Cleaning texts

  • Web scraping is often just the first step and can involve a lot of cleaning, text manipulation, parsing to get structured information
  • Text we collect in the web is not always “clean” :
    • Removing unwanted information : eg. ads, navigation links, html tags, markup characters (eg. , etc.)
    • Replace characters : eg. accents, special characters, encoding issues
    • Segmenting unstructured texts into paragraphs, sentences, words
    • Segmenting unstructured texts by groups : eg. parliamentary speeches by speaker
    • Extracting specific information from texts to get metadata : eg. date, author, title, etc.

Example 1 : parliamentary speeches

# A tibble: 5 × 4
  speaker            gender role               text                             
  <chr>              <chr>  <chr>              <chr>                            
1 President          Male   Chair              Je suis saisi de deux amendement…
2 Barbara Pompili    Female Secretary of State Cet amendement vise à rétablir, …
3 Geneviève Gaillard Female Rapporteur         Il est identique à l’amendement …
4 Daniel Fasquelle   Male   Deputy             Pour une fois, je soutiens le Go…
5 Philippe Plisson   Male   Deputy             Les amendements viennent d’être …

Example 2 : press releases



text date location title
The Progressive Future Alliance… 2025-01-09 London A Brighter Tomorrow: Our vision for 2030

Example 3 : Applauses in parliamentary speeches

Regular expressions

  • Good skills to have to be able to detect patterns in text to be able to clean it
  • For this, we use string matching and regular expressions
  • Regular expressions (regex) are a powerful tool for detecting patterns in unstructured text
  • Regexr

Regular expressions

  • A programming language for pattern matching, really powerful and common
  • Quite complex but only a few basic concepts are needed to get started
  • A lot of resources available online to learn regex and test them
  • Now, ChatGPT can help you to find the right regular expressions

Regex syntax

Regex syntax

Dictionnary methods

Keywords based methods

  • One of the most common tasks and “easy” task in quantitative text analysis is to detect the presence of specific patterns in a text, rule-based methods
  • To measure concepts prevalence, we can use dictionnaries
  • We are all used to use keywords to search for information on the web, but different issues
  • Two main uses of keywrods in text analysis :
    • To measure the prevalence of a concept in a corpora based on the presence of specific words
    • To collect data, subset the population of texts of interest from a larger corpus : eg. Factiva

Dictionaries

Dictionnaries

  • A dictionnary is a list of words or expressions that are used to detect the presence of a concept in a text
  • Binary dictionnaries : presence/absence of a word in a text
  • Frequency dictionnaries : count the number of times a word appears in a text
  • Weighted dictionnaries : assign a weight to each word in the dictionnary : eg. sentiment analysis

Keywords based methods

Example of a dictionnary : Proportion of press releases of the UNSG containing the word Palestine or Israel

Advantages of dictionnary methods

  • Straightforward to implement
  • Fast : no need to train a model, allows to quickly measure the prevalence of a concept in a text
  • Easy understandable
  • Fixable : you can easily add or remove words from the dictionnary to refine dictionnary

Limits of dictionnaries : example on climate change

  • Let’s say we want to measure the prevalence of texts related to climate change in a corpus of texts
  • What words would you include in your dictionnary?

Limits of dictionnaries : example on climate change

Limits of dictionnaries : measurement issues

  • Difficult to accurately measure the prevalence of a concept through a dictionnary
  • Find all the relevant words to include in a dictionnary is really hard

Although humans perform very poorly in the task of recalling large numbers of words from memory, they excel at recognizing whether any given word is an appropriate representation of a given concept. (King, Lam, and Roberts 2017, The Unreliability of Human Keyword Selection)

That is, two human users familiar with the subject area, given the same task, usually select keyword lists that overlap very little, and the list from each is a very small subset of those they would each recognize as useful after the fact. The unreliability is exacerbated by the fact that users may not even be aware of many of the keywords that could be used to select a set of documents. And attempting to find keywords by reading large numbers of documents is likely to be logistically infeasible in a reasonable amount of time.

King, Lam, and Roberts (2017) experiment

  • Ask 43 undergraduates to recall keywords from a sample of twitter posts contaning “healthcare” from the time period surrounding a Supreme Court decision on Obamacare with goal with providing a list of keywords selecting posts related to Obamaware and will not select posts related to Obamacare.
  • Repeat experiment with Boston Marathon bombing

King, Lam, and Roberts (2017) experiment

  • Median number of words recalled 8 for Obamacare and 7 for Boston Marathon

Consequences for statistical bias

  • Different keywords can lead to different document sets and to different inferences/conclusion

Limits of dictionnaries

  • Hard to find what are the most relevant words to include in the dictionnary
  • No knowledge of all the texts
  • Semantic :No context at all : we do not know in which context the words are used (along with other words)
  • Hard to build a dictionnary that is exhaustive
  • Simple and conservative dictionnary :
    • All words matched will are likely to be relevant/ But will miss some relevant texts
  • More flexible dictionnary :
    • Will match more relevant texts / But will also match more irrelevant texts
  • Scunthrope problem :
    • Words that are not relevant but are matched by the dictionnary
    • Spam filter : people from Scunthorpe could not register on web portal because of the word “cunt”

How to build dictionnaries

  • Domain and context knowledge
  • Reading sample of texts
  • Chatgpt building
  • Word embeddings

References

King, Gary, Patrick Lam, and Margaret E Roberts. 2017. “Computer-Assisted Keyword and Document Set Discovery from Unstructured Text.” American Journal of Political Science 61 (4): 971–88.